LexicHash: sequence similarity estimation via lexicographic comparison of hashes

Abstract Motivation Pairwise sequence alignment is a heavy computational burden, particularly in the context of third-generation sequencing technologies. This issue is commonly addressed by approximately estimating sequence similarities using a hash-based method such as MinHash. In MinHash, all k-mers in a read are hashed and the minimum hash value, the min-hash, is stored. Pairwise similarities can then be estimated by counting the number of min-hash matches between a pair of reads, across many distinct hash functions. The choice of the parameter k controls an important tradeoff in the task of identifying alignments: larger k-values give greater confidence in the identification of alignments (high precision) but can lead to many missing alignments (low recall), particularly in the presence of significant noise. Results In this work, we introduce LexicHash, a new similarity estimation method that is effectively independent of the choice of k and attains the high precision of large-k and the high sensitivity of small-k MinHash. LexicHash is a variant of MinHash with a carefully designed hash function. When estimating the similarity between two reads, instead of simply checking whether min-hashes match (as in standard MinHash), one checks how “lexicographically similar” the LexicHash min-hashes are. In our experiments on 40 PacBio datasets, the area under the precision–recall curves obtained by LexicHash had an average improvement of 20.9% over MinHash. Additionally, the LexicHash framework lends itself naturally to an efficient search of the largest alignments, yielding an O(n) time algorithm, and circumventing the seemingly fundamental O(n2) scaling associated with pairwise similarity search. Availability and implementation LexicHash is available on GitHub at https://github.com/gcgreenberg/LexicHash.


Figure S2
. Typical precision-recall curve (PRC) and receiver operator characteristic (ROC) curve using 500 hash functions for an example NCTC dataset (Streptococcus sp.).Though LexicHash outperforms MinHash slightly, AUCs for both LexicHash and MinHash ( = 12) are quite high.More interesting and, we argue, more salient results are at 100 hash functions, since LexicHash is potentially both more computationally efficient and more accurate, and because LexicHash exhibits sufficient performance in a practical setting at a smaller number of hash functions.

Figure S3
. A comparison of runtimes for read-pair aggregation only (i.e., post-sketching) between LexicHash and MinHash (k=16).A linear fit for LexicHash has an  != 0.989.A quadratic fit for MinHash has an  != 0.981, which is a significantly better fit than a linear one, for which  != 0.908.MinHash was unable to find  = 5 read-pairs with a non-zero number of min-hash collisions for  = 10,000 reads, hence the missing data.

NCTC Datasets
We used the first 40 NCTC datasets for our experiments.In each dataset we only considered roughly 1000 reads, since our main aim with these datasets was to evaluate the performance of LexicHash, rather than discover all overlaps in a full dataset for practical use.The reads were picked such that each read had at least 5 significant overlaps (according to Daligner), thus allowing us to compare the performance without too many false positives.Daligner was used to generate the ground truths of these alignments for a fixed threshold  = 0.2.To obtain the ROC curves and PRC curves for MinHash, we considered different thresholds on the calculated Jaccard similarity.For LexicHash we considered different thresholds on the maximum match length between two reads.Disclosure: NCTC datasets were originally preprocessed for the Spectral Jaccard Similarity paper Baharav et al. (2020).The above text was modified from the corresponding Supplemental Information.

Plasmodium falciparum Dataset
The P. falciparum dataset was generated in a similar way as the NCTC datasets.We used minimap2 version 2.17 to find all sets of overlapping read pairs (with mapping quality 60).Then, we pruned the read set to 2000 reads by taking reads with a degree ≥ 40 from the overlap graph.The resulting reads after this process had between 5 and 20 overlaps among the 2000 reads.Minimap2 was used the generate ground truths, and ROCs and PRCs were generated in an identical manner to that of the NCTC datasets.

Escherichia coli Dataset
The E. coli dataset (accession SRR11434956) used to benchmark the runtime comparison between LexicHash and MinHash was sequenced by a PacBio RS II sequencer.The original dataset consisted of 144,238 reads.After preprocessing to exclude reads with > 30% error, the resulting full dataset consists of 113,378 reads.Filtering was done using fastq-filter (https://github.com/LUMC/fastq-filter). Memory and cpu-time were determined using a nonmultiprocessed version of LexicHash.The memory usage increases roughly linearly with the number of CPUs used.

Details of minimizer method
The hash function used in the standard minimizer method is identical to that of MinHash (Alg 2 of Section C), using a single hash function for all reads.For chaining, we used 20 bases for the maximum difference in position differences between consecutive collinear hits in a chain (i.e., the  used in Alg 4 of the minimap paper).For example, when computing a chain between sequences  ) and  !, suppose a hit occurs at position 200 in  ) and 400 in  ! .If the next hit occurs at 300 in  ) and 510 in  !, then the chain is extended, but not if the hit is at position 600 in  !, since the difference increases more than 20 (from 200 to 300).

Details of the comparison to strobemers
We note here that when increasing the number of masks used in LexicHash, the performance does not necessarily improve significantly.This is because the basic version of LexicHash simply uses the maximum match length as the similarity score, which can predict spurious matches due to short genomic repeats (of e.g., 40 bp).Thus, to use an increasing number of hash function in the comparison to strobemers and minimizers, we take the average of the top  largest matches as the similarity score, where  = 2, 3, 4, 8 for  *'+* = 100, 200, 500, 1000, respectively.This has the effect of lowering the score of these false positive read-pairs, while maintaining large scores for real overlaps.

C. Discussion on mask generation
In this section, we explore the effect that different mask generating processes has on the LexicHash score (i.e., match length).For each dataset and mask type, we evaluate the binary hypothesis test (BHT) between deciding whether a pair of reads is "overlapping" or "nonoverlapping" with equal priors, based on the score for a single mask.Particularly, we calculate the probability of error of the BHT,  " =  ,-./ Pr( | ) +  &,&,-./Pr( | ), which happens to be half the area under the minimum of the two overlapping and nonoverlapping PMFs (i.e., the black lines in the figures) since the priors are equal.
The types of mask generation processes evaluated are shown below.In the first experiment, we determine whether generating masks using a deterministic process that evenly "spreads out" the masks is better than randomly generating them by default.In the second, we evaluate whether we can achieve greater statistical power by tailoring the mask generation processes to the dataset at hand.Specifically, we compare generating masks uniformly at random with a distribution that "looks like" the dataset, or like the opposite of the dataset.Uniform: Masks are generated i.i.d.uniform on {A,T,G,C}.This is the default used in the current version of LexicHash.Evenly Distributed: Given a choice for the number of masks, , the starts of the masks use each possible combination of bases, the initial patterns are repeated to a length of  max .For example, if  = 256 = 4 3 (which is used to create the plots in Fig S5 ), the masks are  GGGGGGGG ,  GGGGGGGG , … ,  GGGGGGGG .Biased: Masks are generated i.i.d. with probabilities according to the frequencies of each base in the dataset.Opposite Biased: Mask are generated according to the reverse-complement of the base frequencies.
The datasets evaluated are: I.I.D: 10,000 overlapping and 10,000 nonoverlapping read pairs are generated using a i.i.d.distribution with 36% G+C probability, and an 8% probability of substitution error is added for each read.
NCTC1080 and NCTC3761: These real datasets have G+C contents of 39.1% and 33.5%, respectively.

Figure S5
. Probability mass functions (PMFs) of the LexicHash score for a single mask, for three different types of mask-generating distributions, across three datasets.For the i.i.d.generated dataset, the difference in error probabilities between the uniform and evenly distributed methods is only 0.1%, which is well within the margin of error.However, the same experiment on the real NCTC datasets shows a significant improvement of around 1% when the masks are generated evenly.This experiment is quite insightful and useful since using a deterministic process to generate the masks decreases the complexity compared to using randomization.Notice that for the i.i.d.dataset, the biased mask outperforms the uniform mask with a  " of 41.4%, and the opposite biased underperforms.Interestingly for the real NCTC datasets, while the biased mask still outperforms the default, the opposite biased mask performed the best, which was the case for the majority of the NCTC datasets tested.Clearly, a model with only substitution errors is insufficient to study the role of the generating distribution of LexicHash masks.
Originally, we had hypothesized that the opposite biased mask would perform the best of the three, since it would intrinsically "target" less common k-mers, which might theoretically be more important than common ones.Conversely, one possible explanation for why the biased mask performs best for the i.i.d.dataset is that it serves as a sort of "matched filter" for detecting k-mers that are likely to be in both sequences for overlapping reads.(3,8,25,50) for StrobeMap, and (, ) = (16,10) for minimizers.We can see that the CPUtime increases roughly linearly with the number of hash functions, as expected.As empirically verified in Fig. 8 in the main paper (as well as for other datasets), the performance of LexicHash and strobemers are similar at around 200 hashes, and coincidentally, the CPU-times are also roughly equal at that number of hashes.

E. Algorithms used in the paper
Building the prefix tree Calculating k-mer hash value for MinHash

Figure S6 .
Figure S6.Probability mass functions (PMFs) of the LexicHash score for a single mask, for three types of mask-generating distributions, across three datasets.

Figure S7 .
Figure S7.CPU-time and peak memory usage using LexicHash ( $'( = 32,  $%& = 16) for the E. coli dataset with varying sizes after downsampling.The runtime increases roughly linearly with the number of reads due to the efficient similarity estimation implemented in LexicHash.Naturally, the memory usage increases roughly quadratically with the number of reads, since the true number of overlapping read pairs increases quadratically.Note that if the dataset size was further increased, the quadratic scaling would begin to impact the runtime as well if the default LexicHash is used.

Figure S8 .
Figure S8.CPU-time and peak memory usage using LexicHash on the NCTC1080 dataset for various  $'( and  $%& values.Notice that the memory steadily increases since Python uses more bytes to store larger integers larger (with thresholds at 2 45 and 2 65 ).Additionally, if LexicHash were implemented in C++, for example, we could expect to see memory and runtime jumps above  $'( = 32, as this corresponds to 64 bits, the typical integer size of modern processors.